{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PNM9wqstBSY_"
   },
   "source": [
    "# Policy\n",
    "In reinforcement learning, the agent interacts with environments to improve itself. In this tutorial we will concentrate on the agent part. In Tianshou, both the agent and the core DRL algorithm are implemented in the Policy module. Tianshou provides more than 20 Policy modules, each representing one DRL algorithm. See supported algorithms [here](https://tianshou.readthedocs.io/en/master/03_api/policy/index.html).\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://tianshou.readthedocs.io/en/master/_images/rl-loop.jpg\", title=\"The agents interacting with the environment\">\n",
    "\n",
    "<a> The agents interacting with the environment </a>\n",
    "</div>\n",
    "\n",
    "All Policy modules inherit from a BasePolicy Class and share the same interface."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZqdHYdoJJS51"
   },
   "source": [
    "## Creating your own Policy\n",
    "We will use the simple PGPolicy, also called REINFORCE algorithm Policy, to show the implementation of a Policy Module. The Policy we implement here will be a scaled-down version of [PGPolicy](https://tianshou.readthedocs.io/en/master/03_api/policy/modelfree/pg.html) in Tianshou."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": true,
    "id": "cDlSjASbJmy-",
    "slideshow": {
     "slide_type": ""
    },
    "tags": [
     "hide-cell",
     "remove-output"
    ]
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "from typing import Any, cast\n",
    "\n",
    "import gymnasium as gym\n",
    "import numpy as np\n",
    "import torch\n",
    "\n",
    "from tianshou.data import (\n",
    "    Batch,\n",
    "    ReplayBuffer,\n",
    "    SequenceSummaryStats,\n",
    "    to_torch,\n",
    "    to_torch_as,\n",
    ")\n",
    "from tianshou.data.batch import BatchProtocol\n",
    "from tianshou.data.types import (\n",
    "    BatchWithReturnsProtocol,\n",
    "    DistBatchProtocol,\n",
    "    ObsBatchProtocol,\n",
    "    RolloutBatchProtocol,\n",
    ")\n",
    "from tianshou.policy import BasePolicy\n",
    "from tianshou.policy.modelfree.pg import (\n",
    "    PGTrainingStats,\n",
    "    TDistFnDiscrOrCont,\n",
    "    TPGTrainingStats,\n",
    ")\n",
    "from tianshou.utils import RunningMeanStd\n",
    "from tianshou.utils.net.common import Net\n",
    "from tianshou.utils.net.discrete import Actor\n",
    "from tianshou.utils.torch_utils import policy_within_training_step, torch_train_mode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Protocols\n",
    "Note: as we learned in tutorial [L1_batch](https://tianshou.readthedocs.io/en/master/02_notebooks/L1_Batch.html#), Tianshou uses `Batch` to store data. `Batch` is a dataclass that can store any data you want. In order to have more control about what kind of batch data is expected and produced in each processing step we use protocols.  \n",
    "For example, `BatchWithReturnsProtocol` specifies that the batch should have fields `obs`, `act`, `rew`, `done`, `obs_next`, `info` and `returns`. This is not only for type checking, but also for IDE support.\n",
    "To learn more about protocols, please refer to the official documentation ([PEP 544](https://www.python.org/dev/peps/pep-0544/)) or to mypy documentation ([Protocols](https://mypy.readthedocs.io/en/stable/protocols.html)).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialization\n",
    "Firstly we create the `PGPolicy` by inheriting from `BasePolicy` in Tianshou."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "class PGPolicy(BasePolicy):\n",
    "    \"\"\"Implementation of REINFORCE algorithm.\"\"\"\n",
    "\n",
    "    def __init__(self) -> None:\n",
    "        super().__init__(\n",
    "            action_space=action_space,\n",
    "            observation_space=observation_space,\n",
    "        )\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qc1RnIBbLCDN"
   },
   "source": [
    "The Policy Module mainly does two things:\n",
    "\n",
    "1.   `policy.forward()` receives observation and other information (stored in a Batch) from the environment and returns a new Batch containing the next action and other information.\n",
    "2.   `policy.update()` receives training data sampled from the replay buffer and updates the policy network. It returns a dataclass containing logging details.\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://tianshou.readthedocs.io/en/master/_images/pipeline.png\">\n",
    "\n",
    "<a> policy.forward() and policy.update() </a>\n",
    "</div>\n",
    "\n",
    "We also need to take care of the following things:\n",
    "\n",
    "1.   Since Tianshou is a **Deep** RL libraries, there should be a policy network and a Torch optimizer in our Policy Module.\n",
    "2.   In Tianshou's BasePolicy, `Policy.update()` first calls `Policy.process_fn()` to \n",
    "preprocess training data and computes quantities like episodic returns (gradient free), \n",
    "then it will call `Policy.learn()` to perform the back-propagation.\n",
    "3. Each Policy is accompanied by a dedicated implementation of `TrainingStats` , which store details of each training step.\n",
    "\n",
    "This is how we get the implementation below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "class PGPolicy(BasePolicy[TPGTrainingStats]):\n",
    "    \"\"\"Implementation of REINFORCE algorithm.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self, \n",
    "        actor: torch.nn.Module, \n",
    "        optim: torch.optim.Optimizer, \n",
    "        action_space: gym.Space\n",
    "    ):\n",
    "        super().__init__(\n",
    "            action_space=action_space,\n",
    "            observation_space=observation_space\n",
    "        )\n",
    "        self.actor = model\n",
    "        self.optim = optim\n",
    "\n",
    "    def process_fn(\n",
    "        self, \n",
    "        batch: RolloutBatchProtocol, \n",
    "        buffer: ReplayBuffer, \n",
    "        indices: np.ndarray\n",
    "    ) -> BatchWithReturnsProtocol:\n",
    "        \"\"\"Compute the discounted returns for each transition.\"\"\"\n",
    "        return batch\n",
    "\n",
    "    def forward(\n",
    "        self, \n",
    "        batch: ObsBatchProtocol,\n",
    "        state: dict | BatchProtocol | np.ndarray | None = None,\n",
    "        **kwargs: Any\n",
    "    ) -> DistBatchProtocol:\n",
    "        \"\"\"Compute action over the given batch data.\"\"\"\n",
    "        act = None\n",
    "        return Batch(act=act)\n",
    "\n",
    "    def learn(\n",
    "        self,\n",
    "        batch: BatchWithReturnsProtocol, \n",
    "        batch_size: int | None, \n",
    "        repeat: int,\n",
    "        *args: Any,\n",
    "        **kwargs: Any,\n",
    "    ) -> TPGTrainingStats:\n",
    "        \"\"\"Perform the back-propagation.\"\"\"\n",
    "        return PGTrainingStats(loss=loss_summary_stat)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tjtqjt8WRY5e"
   },
   "source": [
    "### Policy.forward()\n",
    "According to the equation of REINFORCE algorithm in Spinning Up's [documentation](https://spinningup.openai.com/en/latest/algorithms/vpg.html), we need to map the observation to an action distribution in action space using the neural network (`self.actor`).\n",
    "\n",
    "<div align=center>\n",
    "<img src=\"https://spinningup.openai.com/en/latest/_images/math/3d29a18c0f98b1cdb656ecdf261ee37ffe8bb74b.svg\">\n",
    "</div>\n",
    "\n",
    "Let us suppose the action space is discrete, and the distribution is a simple categorical distribution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "def forward(\n",
    "    self,\n",
    "    batch: ObsBatchProtocol,\n",
    "    state: dict | BatchProtocol | np.ndarray | None = None,\n",
    "    **kwargs: Any,\n",
    ") -> DistBatchProtocol:\n",
    "    \"\"\"Compute action over the given batch data.\"\"\"\n",
    "    logits, hidden = self.actor(batch.obs, state=state)\n",
    "    dist = self.dist_fn(logits)\n",
    "    act = dist.sample()\n",
    "    result = Batch(logits=logits, act=act, state=hidden, dist=dist)\n",
    "    return result\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CultfOeuTx2V"
   },
   "source": [
    "### Policy.process_fn()\n",
    "Now that we have defined our actor, if given training data we can set up a loss function and optimize our neural network. However, before that, we must first calculate episodic returns for every step in our training data to construct the REINFORCE loss function.\n",
    "\n",
    "Calculating episodic return is not hard, given `ReplayBuffer.next()` allows us to access every reward to go in an episode. A more convenient way would be to simply use the built-in method `BasePolicy.compute_episodic_return()` inherited from BasePolicy.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "def process_fn(\n",
    "    self,\n",
    "    batch: RolloutBatchProtocol,\n",
    "    buffer: ReplayBuffer,\n",
    "    indices: np.ndarray,\n",
    ") -> BatchWithReturnsProtocol:\n",
    "    \"\"\"Compute the discounted returns for each transition.\"\"\"\n",
    "    v_s_ = np.full(indices.shape, self.ret_rms.mean)\n",
    "    returns, _ = self.compute_episodic_return(batch, buffer, indices, v_s_=v_s_, gamma=0.99, gae_lambda=1.0)\n",
    "    batch.returns = returns\n",
    "    return batch\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XA8OF4GnWWr5"
   },
   "source": [
    "`BasePolicy.compute_episodic_return()` could also be used to compute [GAE](https://arxiv.org/abs/1506.02438). Another similar method is `BasePolicy.compute_nstep_return()`. Check the [source code](https://github.com/thu-ml/tianshou/blob/6fc68578127387522424460790cbcb32a2bd43c4/tianshou/policy/base.py#L304) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7UsdzNaOXPpC"
   },
   "source": [
    "### Policy.learn()\n",
    "Data batch returned by `Policy.process_fn()` will flow into `Policy.learn()`. Finally,\n",
    "we can construct our loss function and perform the back-propagation. The method \n",
    "should look something like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "def learn(\n",
    "    self,\n",
    "    batch: BatchWithReturnsProtocol,\n",
    "    batch_size: int | None,\n",
    "    repeat: int,\n",
    "    *args: Any,\n",
    "    **kwargs: Any,\n",
    ") -> TPGTrainingStats:\n",
    "    \"\"\"Perform the back-propagation.\"\"\"\n",
    "    losses = []\n",
    "    split_batch_size = batch_size or -1\n",
    "    for _ in range(repeat):\n",
    "        for minibatch in batch.split(split_batch_size, merge_last=True):\n",
    "            self.optim.zero_grad()\n",
    "            result = self(minibatch)\n",
    "            dist = result.dist\n",
    "            act = to_torch_as(minibatch.act, result.act)\n",
    "            ret = to_torch(minibatch.returns, torch.float, result.act.device)\n",
    "            log_prob = dist.log_prob(act).reshape(len(ret), -1).transpose(0, 1)\n",
    "            loss = -(log_prob * ret).mean()\n",
    "            loss.backward()\n",
    "            self.optim.step()\n",
    "            losses.append(loss.item())\n",
    "    loss_summary_stat = SequenceSummaryStats.from_sequence(losses)\n",
    "\n",
    "    return PGTrainingStats(loss=loss_summary_stat)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1BtuV2W0YJTi"
   },
   "source": [
    "## Implementation\n",
    "Now we can assemble the methods and form a PGPolicy. The outputs of\n",
    "`learn` will be collected to a dedicated dataclass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PGPolicy(BasePolicy[TPGTrainingStats]):\n",
    "    \"\"\"Implementation of REINFORCE algorithm.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        *,\n",
    "        actor: torch.nn.Module,\n",
    "        optim: torch.optim.Optimizer,\n",
    "        dist_fn: TDistFnDiscrOrCont,\n",
    "        action_space: gym.Space,\n",
    "        discount_factor: float = 0.99,\n",
    "        observation_space: gym.Space | None = None,\n",
    "    ) -> None:\n",
    "        super().__init__(\n",
    "            action_space=action_space,\n",
    "            observation_space=observation_space,\n",
    "        )\n",
    "        self.actor = actor\n",
    "        self.optim = optim\n",
    "        self.dist_fn = dist_fn\n",
    "        assert 0.0 <= discount_factor <= 1.0, \"discount factor should be in [0, 1]\"\n",
    "        self.gamma = discount_factor\n",
    "        self.ret_rms = RunningMeanStd()\n",
    "\n",
    "    def process_fn(\n",
    "        self,\n",
    "        batch: RolloutBatchProtocol,\n",
    "        buffer: ReplayBuffer,\n",
    "        indices: np.ndarray,\n",
    "    ) -> BatchWithReturnsProtocol:\n",
    "        \"\"\"Compute the discounted returns (Monte Carlo estimates) for each transition.\n",
    "\n",
    "        They are added to the batch under the field `returns`.\n",
    "        Note: this function will modify the input batch!\n",
    "        \"\"\"\n",
    "        v_s_ = np.full(indices.shape, self.ret_rms.mean)\n",
    "        # use a function inherited from BasePolicy to compute returns\n",
    "        # gae_lambda = 1.0 means we use Monte Carlo estimate\n",
    "        batch.returns, _ = self.compute_episodic_return(\n",
    "            batch,\n",
    "            buffer,\n",
    "            indices,\n",
    "            v_s_=v_s_,\n",
    "            gamma=self.gamma,\n",
    "            gae_lambda=1.0,\n",
    "        )\n",
    "        batch: BatchWithReturnsProtocol\n",
    "        return batch\n",
    "\n",
    "    def forward(\n",
    "        self,\n",
    "        batch: ObsBatchProtocol,\n",
    "        state: dict | BatchProtocol | np.ndarray | None = None,\n",
    "        **kwargs: Any,\n",
    "    ) -> DistBatchProtocol:\n",
    "        \"\"\"Compute action over the given batch data by applying the actor.\n",
    "\n",
    "        Will sample from the dist_fn, if appropriate.\n",
    "        Returns a new object representing the processed batch data\n",
    "        (contrary to other methods that modify the input batch inplace).\n",
    "        \"\"\"\n",
    "        logits, hidden = self.actor(batch.obs, state=state)\n",
    "\n",
    "        if isinstance(logits, tuple):\n",
    "            dist = self.dist_fn(*logits)\n",
    "        else:\n",
    "            dist = self.dist_fn(logits)\n",
    "\n",
    "        act = dist.sample()\n",
    "        return cast(DistBatchProtocol, Batch(logits=logits, act=act, state=hidden, dist=dist))\n",
    "\n",
    "    def learn(  # type: ignore\n",
    "        self,\n",
    "        batch: BatchWithReturnsProtocol,\n",
    "        batch_size: int | None,\n",
    "        repeat: int,\n",
    "        *args: Any,\n",
    "        **kwargs: Any,\n",
    "    ) -> TPGTrainingStats:\n",
    "        losses = []\n",
    "        split_batch_size = batch_size or -1\n",
    "        for _ in range(repeat):\n",
    "            for minibatch in batch.split(split_batch_size, merge_last=True):\n",
    "                self.optim.zero_grad()\n",
    "                result = self(minibatch)\n",
    "                dist = result.dist\n",
    "                act = to_torch_as(minibatch.act, result.act)\n",
    "                ret = to_torch(minibatch.returns, torch.float, result.act.device)\n",
    "                log_prob = dist.log_prob(act).reshape(len(ret), -1).transpose(0, 1)\n",
    "                loss = -(log_prob * ret).mean()\n",
    "                loss.backward()\n",
    "                self.optim.step()\n",
    "                losses.append(loss.item())\n",
    "\n",
    "        loss_summary_stat = SequenceSummaryStats.from_sequence(losses)\n",
    "\n",
    "        return PGTrainingStats(loss=loss_summary_stat)  # type: ignore[return-value]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xlPAbh0lKti8"
   },
   "source": [
    "## Use the policy\n",
    "Note that `BasePolicy` itself inherits from `torch.nn.Module`. As a result, you can consider all Policy modules as a Torch Module. They share similar APIs.\n",
    "\n",
    "Firstly we will initialize a new PGPolicy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JkLFA9Z1KjuX"
   },
   "outputs": [],
   "source": [
    "state_shape = 4\n",
    "action_shape = 2\n",
    "# Usually taken from an env by using env.action_space\n",
    "action_space = gym.spaces.Box(low=-1, high=1, shape=(2,))\n",
    "net = Net(state_shape, hidden_sizes=[16, 16], device=\"cpu\")\n",
    "actor = Actor(net, action_shape, device=\"cpu\").to(\"cpu\")\n",
    "optim = torch.optim.Adam(actor.parameters(), lr=0.0003)\n",
    "dist_fn = torch.distributions.Categorical\n",
    "\n",
    "policy: BasePolicy\n",
    "policy = PGPolicy(actor=actor, optim=optim, dist_fn=dist_fn, action_space=action_space)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LAo_0t2fekUD"
   },
   "source": [
    "PGPolicy shares same APIs with the Torch Module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "UiuTc8RhJiEi",
    "outputId": "9b5bc54c-6303-45f3-ba81-2216a44931e8"
   },
   "outputs": [],
   "source": [
    "print(policy)\n",
    "print(\"========================================\")\n",
    "for param in policy.parameters():\n",
    "    print(param.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-RCrsttYgAG-"
   },
   "source": [
    "### Making decision\n",
    "Given a batch of observations, the policy can return a batch of actions and other data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0jkBb6AAgUla",
    "outputId": "37948844-cdd8-4567-9481-89453c80a157"
   },
   "outputs": [],
   "source": [
    "obs_batch = Batch(obs=np.ones(shape=(256, 4)))\n",
    "dist_batch = policy(obs_batch)  # forward() method is called\n",
    "print(\"Next action for each observation: \\n\", dist_batch.act)\n",
    "print(\"Dsitribution: \\n\", dist_batch.dist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "swikhnuDfKep"
   },
   "source": [
    "### Save and Load models\n",
    "Naturally, Tianshou Policy can be saved and loaded like a normal Torch Network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tYOoWM_OJRnA"
   },
   "outputs": [],
   "source": [
    "torch.save(policy.state_dict(), \"policy.pth\")\n",
    "assert policy.load_state_dict(torch.load(\"policy.pth\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gp8PzOYsg5z-"
   },
   "source": [
    "### Algorithm Updating\n",
    "We have to collect some data and save them in the ReplayBuffer before updating our agent(policy). Typically we use collector to collect data, but we leave this part till later when we have learned the Collector in Tianshou. For now we generate some **fake** data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XrrPxOUAYShR"
   },
   "source": [
    "#### Generating fake data\n",
    "Firstly, we need to \"pretend\" that we are using the \"Policy\" to collect data. We plan to collect 10 data so that we can update our algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "a14CmzSfYh5C",
    "outputId": "aaf45a1f-5e21-4bc8-cbe3-8ce798258af0"
   },
   "outputs": [],
   "source": [
    "dummy_buffer = ReplayBuffer(size=10)\n",
    "print(dummy_buffer)\n",
    "print(f\"maxsize: {dummy_buffer.maxsize}, data length: {len(dummy_buffer)}\")\n",
    "env = gym.make(\"CartPole-v1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8S94cV7yZITR"
   },
   "source": [
    "Now we are pretending to collect the first episode. The first episode ends at step 3 (perhaps because we are performing too badly)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "a_mtvbmBZbfs"
   },
   "outputs": [],
   "source": [
    "obs, info = env.reset()\n",
    "for i in range(3):\n",
    "    act = policy(Batch(obs=obs[np.newaxis, :])).act.item()\n",
    "    obs_next, rew, _, truncated, info = env.step(act)\n",
    "    # pretend ending at step 3\n",
    "    terminated = i == 2\n",
    "    info[\"id\"] = i\n",
    "    dummy_buffer.add(\n",
    "        Batch(\n",
    "            obs=obs,\n",
    "            act=act,\n",
    "            rew=rew,\n",
    "            terminated=terminated,\n",
    "            truncated=truncated,\n",
    "            obs_next=obs_next,\n",
    "            info=info,\n",
    "        ),\n",
    "    )\n",
    "    obs = obs_next"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dummy_buffer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pkxq4gu9bGkt"
   },
   "source": [
    "Now we are pretending to collect the second episode. At step 7 the second episode still doesn't end, but we are unwilling to wait, so we stop collecting to update the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pAoKe02ybG68"
   },
   "outputs": [],
   "source": [
    "obs, info = env.reset()\n",
    "for i in range(3, 10):\n",
    "    # For retrieving actions to be used for training, we set the policy to training mode,\n",
    "    # but the wrapped torch module should be in eval mode.\n",
    "    with policy_within_training_step(policy), torch_train_mode(policy, enabled=False):\n",
    "        act = policy(Batch(obs=obs[np.newaxis, :])).act.item()\n",
    "    obs_next, rew, _, truncated, info = env.step(act)\n",
    "    # pretend this episode never end\n",
    "    terminated = False\n",
    "    info[\"id\"] = i\n",
    "    dummy_buffer.add(\n",
    "        Batch(\n",
    "            obs=obs,\n",
    "            act=act,\n",
    "            rew=rew,\n",
    "            terminated=terminated,\n",
    "            truncated=truncated,\n",
    "            obs_next=obs_next,\n",
    "            info=info,\n",
    "        ),\n",
    "    )\n",
    "    obs = obs_next"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MKM6aWMucv-M"
   },
   "source": [
    "Our replay buffer looks like this now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "CSJEEWOqXdTU",
    "outputId": "2b3bb75c-f219-4e82-ca78-0ea6173a91f9"
   },
   "outputs": [],
   "source": [
    "print(dummy_buffer)\n",
    "print(f\"maxsize: {dummy_buffer.maxsize}, data length: {len(dummy_buffer)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "55VWhWpkdfEb"
   },
   "source": [
    "#### Updates\n",
    "Now we have got a replay buffer with 10 data steps in it. We can call `Policy.update()` to train.\n",
    "\n",
    "However, we need to manually set the torch module to training mode prior to that, \n",
    "and also declare that we are within a training step. Tianshou Trainers will take care of that automatically,\n",
    "but users need to consider it when calling `.update` outside of the trainer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "i_O1lJDWdeoc",
    "outputId": "b154741a-d6dc-46cb-898f-6e84fa14e5a7"
   },
   "outputs": [],
   "source": [
    "# 0 means sample all data from the buffer\n",
    "\n",
    "# For updating the policy, the policy should be in training mode\n",
    "# and the wrapped torch module should also be in training mode (unlike when collecting data).\n",
    "with policy_within_training_step(policy), torch_train_mode(policy):\n",
    "    policy.update(sample_size=0, buffer=dummy_buffer, batch_size=10, repeat=6).pprint_asdict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QJ5krjrcbuiA"
   },
   "source": [
    "## Further Reading\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pmWi3HuXWcV8"
   },
   "source": [
    "### Pre-defined Networks\n",
    "Tianshou provides numerous pre-defined networks usually used in DRL so that you don't have to bother yourself. Check this [documentation](https://tianshou.readthedocs.io/en/master/03_api/utils/net/index.html) for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UPVl5LBEWJ0t"
   },
   "source": [
    "### How to compute GAE on your own?\n",
    "(Note that for this reading you need to understand the calculation of [GAE](https://arxiv.org/abs/1506.02438) advantage first)\n",
    "\n",
    "In terms of code implementation, perhaps the most difficult and annoying part is computing GAE advantage. Just now, we use the `self.compute_episodic_return()` method inherited from `BasePolicy` to save us from all those troubles. However, it is still important that we know the details behind this.\n",
    "\n",
    "To compute GAE advantage, the usage of `self.compute_episodic_return()` may go like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "D34GlVvPNz08",
    "outputId": "43a4e5df-59b5-4e4a-c61c-e69090810215"
   },
   "outputs": [],
   "source": [
    "batch, indices = dummy_buffer.sample(0)  # 0 means sampling all the data from the buffer\n",
    "returns, advantage = BasePolicy.compute_episodic_return(\n",
    "    batch=batch,\n",
    "    buffer=dummy_buffer,\n",
    "    indices=indices,\n",
    "    v_s_=np.zeros(10),\n",
    "    v_s=np.zeros(10),\n",
    "    gamma=1.0,\n",
    "    gae_lambda=1.0,\n",
    ")\n",
    "print(f\"{batch.rew=}\")\n",
    "print(f\"{batch.done=}\")\n",
    "print(f\"{returns=}\")\n",
    "print(f\"{advantage=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the code above, we sample all the 10 data in the buffer and try to compute the GAE advantage. However, the way the returns are computed here might be a bit misleading. In fact, the last episode is unfinished, but its last step saved in the batch is treated as a terminal state, since it assumes that there are no future rewards. The episode is not terminated yet, it is truncated, so the agent could still get rewards in the future. Terminated and truncated episodes should indeed be treated differently.\n",
    "The return of a step is the (discounted) sum of the future rewards from that step until the end of the episode. \n",
    "\\begin{equation}\n",
    "R_{t}=\\sum_{t}^{T} \\gamma^{t} r_{t}\n",
    "\\end{equation}\n",
    "Thus, at the last step of a terminated episode the return is equal to the reward at that state, since there are no future states.\n",
    "\\begin{equation}\n",
    "R_{T,terminated}=r_{T}\n",
    "\\end{equation}\n",
    "\n",
    "However, if the episode was truncated the return at the last step is usually better represented by the estimated value of that state, which is the expected return from that state onwards.\n",
    "\\begin{align*}\n",
    "R_{T,truncated}=V^{\\pi}\\left(s_{T}\\right)  \\quad & \\text{or} \\quad R_{T,truncated}=Q^{\\pi}(s_{T},a_{T})\n",
    "\\end{align*}\n",
    "Moreover, if the next state was also observed (but not its reward), then an even better estimate would be the reward of the last step plus the discounted value of the next state.\n",
    "\\begin{align*}\n",
    "R_{T,truncated}=r_T+\\gamma V^{\\pi}\\left(s_{T+1}\\right)\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "h_5Dt6XwQLXV"
   },
   "source": [
    "\n",
    "As we know, we need to estimate the value function of every observation to compute GAE advantage. So in `v_s` is the value of `batch.obs`, and in `v_s_` is the value of `batch.obs_next`. This is usually computed by:\n",
    "\n",
    "`v_s = critic(batch.obs)`,\n",
    "\n",
    "`v_s_ = critic(batch.obs_next)`,\n",
    "\n",
    "where both `v_s` and `v_s_` are 10 dimensional arrays and `critic` is usually a neural network.\n",
    "\n",
    "After we've got all those values, GAE can be computed following the equation below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ooHNIICGUO19"
   },
   "source": [
    "\\begin{aligned}\n",
    "\\hat{A}_{t}^{\\mathrm{GAE}(\\gamma, \\lambda)}: =& \\sum_{l=0}^{\\infty}(\\gamma \\lambda)^{l} \\delta_{t+l}^{V}\n",
    "\\end{aligned}\n",
    "\n",
    "where\n",
    "\n",
    "\\begin{equation}\n",
    "\\delta_{t}^{V} \\quad=-V\\left(s_{t}\\right)+r_{t}+\\gamma V\\left(s_{t+1}\\right)\n",
    "\\end{equation}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eV6XZaouU7EV"
   },
   "source": [
    "Unfortunately, if you  follow this equation, which is taken from the paper, you probably will get a slightly lower performance than you expected. There are at least 3 \"bugs\" in this equation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FCxD9gNNVYbd"
   },
   "source": [
    "**First** is that Gym always returns you a `obs_next` even if this is already the last step. The value of this timestep is exactly 0 and you should not let the neural network estimate it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "rNZNUNgQVvRJ",
    "outputId": "44354595-c25a-4da8-b4d8-cffa31ac4b7d"
   },
   "outputs": [],
   "source": [
    "# Assume v_s_ is got by calling critic(batch.obs_next)\n",
    "v_s_ = np.ones(10)\n",
    "v_s_ *= ~batch.done\n",
    "print(f\"{v_s_=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2EtMi18QWXTN"
   },
   "source": [
    "After the fix above, we will perhaps get a more accurate estimate.\n",
    "\n",
    "**Secondly**, you must know when to stop bootstrapping. Usually we stop bootstrapping when we meet a `done` flag. However, in the buffer above, the last (10th) step is not marked by done=True, because the collecting has not finished. We must know all those unfinished steps so that we know when to stop bootstrapping.\n",
    "\n",
    "Luckily, this can be done under the assistance of buffer because buffers in Tianshou not only store data, but also help you manage data trajectories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "saluvX4JU6bC",
    "outputId": "2994d178-2f33-40a0-a6e4-067916b0b5c5"
   },
   "outputs": [],
   "source": [
    "unfinished_indexes = dummy_buffer.unfinished_index()\n",
    "print(\"unfinished_indexes: \", unfinished_indexes)\n",
    "done_indexes = np.where(batch.done)[0]\n",
    "print(\"done_indexes: \", done_indexes)\n",
    "stop_bootstrap_ids = np.concatenate([unfinished_indexes, done_indexes])\n",
    "print(\"stop_bootstrap_ids: \", stop_bootstrap_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qp6vVE4dYWv1"
   },
   "source": [
    "**Thirdly**, there are some special indexes which are marked by done flag, however its value for obs_next should not be zero. It is again because done does not differentiate between terminated and truncated. These steps are usually those at the last step of an episode, but this episode stops not because the agent can no longer get any rewards (value=0), but because the episode is too long so we have to truncate it. These kind of steps are always marked with `info['TimeLimit.truncated']=True` in Gym."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tWkqXRJfZTvV"
   },
   "source": [
    "As a result, we need to rewrite the equation above\n",
    "\n",
    "`v_s_ *= ~batch.done`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kms-QtxKZe-M"
   },
   "source": [
    "to\n",
    "\n",
    "```\n",
    "mask = batch.info['TimeLimit.truncated'] | (~batch.done)\n",
    "v_s_ *= mask\n",
    "\n",
    "```\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "u_aPPoKraBu6"
   },
   "source": [
    "### Summary\n",
    "If you already felt bored by now, simply remember that Tianshou can help handle all these little details so that you can focus on the algorithm itself. Just call `BasePolicy.compute_episodic_return()`.\n",
    "\n",
    "If you still feel interested, we would recommend you check Appendix C in this [paper](https://arxiv.org/abs/2107.14171v2) and implementation of `BasePolicy.value_mask()` and `BasePolicy.compute_episodic_return()` for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2cPnUXRBWKD9"
   },
   "source": [
    "<center>\n",
    "<img src=../_static/images/timelimit.svg></img>\n",
    "</center>\n",
    "<center>\n",
    "<img src=../_static/images/policy_table.svg></img>\n",
    "</center>"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
